Personal Loan Campaign Modelling Project

by: Garey Salinas

Description

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

  1. To predict whether a liability customer will buy a personal loan or not.
  2. Which variables are most significant.
  3. Which segment of customers should be targeted more.

Data Dictionary

LABELS DESCRIPTION
ID Customer ID
Age Customer’s age in completed years
Experience #years of professional experience
Income Annual income of the customer (in thousand dollars)
ZIP Code Home Address ZIP code.
Family the Family size of the customer
CCAvg Average spending on credit cards per month (in thousand dollars)
Education Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
Mortgage Value of house mortgage if any. (in thousand dollars)
Personal_Loan Did this customer accept the personal loan offered in the last campaign?
Securities_Account Does the customer have securities account with the bank?
CD_Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Do customers use internet banking facilities?
CreditCard Does the customer use a credit card issued by any other Bank (excluding All life Bank)?

Import libraries and load dataset

Import libraries

Read Dataset

Overview of Dataset

Edit column names

Observation

Check for duplicates

Describe dataset

Observations

Change dtypes

Observations

Observations

Observations

Exploratory Data Analysis

Univariate Analysis

Observations on age

Observations

Observations on income

Observations

Observations on income outliers

Observations on ccavg

Observations

Observations on ccavg outliers

Observations on mortgage

Observations

Observations on mortgage outliers

Check zero values in mortgage column

Check zipcodes frequency where mortgage equals zero.

Observations

Observations on experience

Observations

Observations

Countplot for experience less than zero vs. age.

Observations

Taking absolute values of the experience column

Observations

Overview on distributions of numerical columns.

Overview on the dispersion of numerical columns.

Display value counts from categorical columns

Observations on zipcode

Observations

Observations on family

Observations

Observations on education

Observations

Oberservations on personal_loan

Observations

Observations on securities_account

Observations

Observations on cd_account

Observations

Observations on online

Observations

Observations on credit_card

Observations

Bivariate Analysis

Correlation and heatmap

Observations

Observations

Show without outliers in boxplots

Observations

personal_loan vs family

Observations

personal_loan vs education

Observations

personal_loan vs secuities_account

Observations

personal_loan vs cd_account

Observations

personal_loan vs online

Observations

personal_loan vs credit_card

Observations

cd_account vs family

Observations

cd_account vs education

Observations

Observations

cd_account vs securities_account

Observations

cd_account vs online

Observations

cd_account vs credit_card

Observations

Let us check which of these differences are statistically significant.

The Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them.

$H_0$: There is no association between the two variables.
$H_a$: There is an association between two variables.

Key Observations -

Build Model, Train and Evaluate

  1. Data preparation
  2. Partition the data into train and test set.
  3. Build a CART model on the train data.
  4. Tune the model and prune the tree, if required.
  5. Test the data on test set.

Partition Data

Build Initial Decision Tree Model

Observations

Recall score from baseline model.

Visualizing the decision tree from baseline model

Feature importance from baseline model

Using GridSearch for hyperparameter tuning of our tree model.

Confusion matrix using GridSearchCV

Recall score using GridSearchCV

Visualizing the decision tree from the best fit estimator using GridSearchCV

Feature importance using GridSearchCV

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Visualizing the Decision Tree

Decision tree model with post pruning has given the best recall score on data.

Conclusion

Recommendations